Improve test coverage and report by skyw · Pull Request #119 · NVIDIA-NeMo/Emerging-Optimizers

skyw · 2026-03-06T00:40:17Z

Half done by Claude Code and I reviewed AI written code.

some tiny bugs a fixed along with it.

Signed-off-by: Hao Wu <skyw@nvidia.com>

copy-pr-bot · 2026-03-06T00:40:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-03-06T00:48:48Z

Greptile Summary

This PR improves test coverage for SOAP utilities, scalar optimizers, and precondition schedules, while fixing two bugs: (1) empty Kronecker factor eigenbasis shape corrected from (0,) to (0, 0) in both get_eigenbasis_eigh and get_eigenbasis_qr, and (2) met_approx_eigvals_criteria condition rewritten to an equivalent but arguably clearer form. CI reporting is also improved with per-test XML output files.

Key changes:

soap_utils.py: torch.empty(0) → torch.empty(0, 0) for empty Kronecker factors in two places — correct fix.
eig.py: met_approx_eigvals_criteria formula is algebraically equivalent to the old form; also fixes the return docstring and a parameter name in orthogonal_iteration.
test_soap_utils.py: New tests for zero-dim eigenbasis, empty QR factor, and all_eigenbases_met_criteria (positive and negative cases). The positive case now correctly uses .eigenvectors. A minor device inconsistency exists: torch.empty(0, 0) in the dims=[64, 0, 32] parameterized case of test_get_eigenbasis_eigh is not given device=self.device, which produces a CPU tensor in an otherwise-CUDA list on GPU runs.
test_scalar_optimizers.py: AdEMAMix parameterization fixes (shadowed variable bugs removed), plus two new Lion update tests. The num_beta_slow_warmup_steps=2 variant in the AdEMAMix test doesn't exercise the slow-beta schedule path since alpha=0 completely nullifies the slow EMA contribution.
test_soap.py: Schedule tests reorganised into a new ScheduleTest class with CosineSchedule and StepSchedule cases added.
L0_Tests_CPU.sh: Per-test XML reports and a loop deduplicating the torchrun calls; test_soap.py and test_soap_utils.py are still not wired into CI.

Confidence Score: 4/5

Safe to merge; changes are test improvements and minor bug fixes with no risk to production optimizer logic.
The two production-code fixes (empty tensor shape and formula rewrite in met_approx_eigvals_criteria) are mathematically sound and well-tested by the new cases. The only actionable finding is a device inconsistency on torch.empty(0, 0) in the dims=[64, 0, 32] test case, which is harmless on CPU CI but would surface on CUDA runs.
tests/test_soap_utils.py (device inconsistency in empty-tensor parameterized case)

Comments Outside Diff (1)

tests/test_soap_utils.py, line 137 (link)

Missing device=self.device on empty tensor in parameterized test.

When the new {"dims": [64, 0, 32]} parameterized case is run on a CUDA device, torch.empty(0, 0) defaults to CPU while the dim=64 and dim=32 tensors are created on self.device (CUDA). The returned Q_list[1] will also be on CPU (since get_eigenbasis_eigh returns torch.empty(0, 0, device=kronecker_factor.device)), leaving a device-inconsistent output list. Any downstream consumer that feeds the full Q_list into a CUDA kernel would hit a device mismatch error.

_{Last reviewed commit: 1215f18}

tests/test_scalar_optimizers.py

greptile-apps · 2026-03-06T00:49:00Z

tests/test_soap_utils.py

+    def test_all_eigenbases_met_criteria_true_eigenbasis_returns_true(self, N: int) -> None:
+        kronecker_factor_list = [torch.randn(N, N, device=self.device)]
+
+        eigenbasis_list = [torch.diag(torch.linalg.eigh(K).eigenvalues) for K in kronecker_factor_list]
+        self.assertTrue(soap_utils.all_eigenbases_met_criteria(kronecker_factor_list, eigenbasis_list))


Wrong eigh attribute used — .eigenvalues instead of .eigenvectors

torch.linalg.eigh returns a named tuple with .eigenvalues (1-D vector λ) and .eigenvectors (N×N orthogonal matrix Q). The test wraps the 1-D eigenvalues in torch.diag(), producing a diagonal eigenvalue matrix D, and passes that to all_eigenbases_met_criteria.

However, the conjugate function (used internally) assumes its second argument is an orthogonal matrix. Passing a diagonal eigenvalue matrix instead breaks this invariant. The met_approx_eigvals_criteria check will compute a meaningless result and likely pass by chance, so the test does not validate the intended mathematical property.

Additionally, K = torch.randn(N, N) on line 200 is not symmetric; calling torch.linalg.eigh on it is undefined behaviour (PyTorch silently uses only the lower triangular part).

The test should construct a symmetric matrix and use the eigenvectors:

def test_all_eigenbases_met_criteria_true_eigenbasis_returns_true(self, N: int) -> None: g = torch.randn(N, N, device=self.device) K_sym = g @ g.T + torch.eye(N, device=self.device) * 1e-5 # symmetric PSD kronecker_factor_list = [K_sym] eigenbasis_list = [torch.linalg.eigh(K_sym).eigenvectors] self.assertTrue(soap_utils.all_eigenbases_met_criteria(kronecker_factor_list, eigenbasis_list))

@mkhona-nvidia , should we use eigen values or eigen vectors?

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Hao Wu <skyw@users.noreply.github.com>

Signed-off-by: Hao Wu <skyw@nvidia.com>

tests/test_scalar_optimizers.py

tests/ci/L0_Tests_CPU.sh

Signed-off-by: Hao Wu <skyw@nvidia.com>

tests/test_soap_utils.py

tests/test_scalar_optimizers.py

Signed-off-by: Hao Wu <skyw@nvidia.com>

skyw · 2026-03-06T01:16:46Z

/ok to test b4ea359

tests/ci/L0_Tests_CPU.sh

tests/test_soap_utils.py

github-actions · 2026-03-06T01:41:04Z

Test Results

46 files + 12 96 suites +38 1m 11s ⏱️ +2s
955 tests + 38 955 ✅ + 38 0 💤 ±0 0 ❌ ±0
2 141 runs +316 2 141 ✅ +316 0 💤 ±0 0 ❌ ±0

Results for commit 1215f18. ± Comparison against base commit 7056267.

This pull request removes 2 and adds 40 tests. Note that renamed tests count towards both.

__main__.ScalarOptimizerTest ‑ test_calculate_ademamix_update_with_alpha_zero_equals_adam
__main__.SoapFunctionsTest ‑ test_soap_optimizer_class_based_schedule

__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_close_to_non_distributed0 (shape=(3, 32))
__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_close_to_non_distributed1 (shape=(5, 100))
__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_with_partial_tp_close_to_non_distributed0 (shape=(32, 3), transpose=True, tp_size=2)
__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_with_partial_tp_close_to_non_distributed1 (shape=(5, 100), transpose=False, tp_size=4)
__main__.DistributedNewtonSchulzCpuTest ‑ test_5steps_with_transpose_close_to_non_distributed0 (shape=(32, 3), transpose=True)
__main__.DistributedNewtonSchulzCpuTest ‑ test_5steps_with_transpose_close_to_non_distributed1 (shape=(5, 100), transpose=False)
__main__.DistributedNewtonSchulzCpuTest ‑ test_distributed_normalize_close_to_non_distributed0 (shape=(21, 16))
__main__.DistributedNewtonSchulzCpuTest ‑ test_distributed_normalize_close_to_non_distributed1 (shape=(16, 32))
__main__.DistributedNewtonSchulzStepCpuTest ‑ test_close_to_non_distributed0 (shape=(21, 16))
__main__.DistributedNewtonSchulzStepCpuTest ‑ test_close_to_non_distributed1 (shape=(16, 32))
…

♻️ This comment has been updated with latest results.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Hao Wu <skyw@users.noreply.github.com>

skyw · 2026-03-06T16:34:21Z

/ok to test 23e4b0c

tests/test_scalar_optimizers.py

Signed-off-by: Hao Wu <skyw@nvidia.com>

skyw · 2026-03-06T17:14:32Z

/ok to test 0825b7e

tests/test_soap.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Hao Wu <skyw@users.noreply.github.com>

skyw · 2026-03-06T17:26:50Z

/ok to test 1215f18

skyw added 8 commits March 5, 2026 14:23

collect report for CPU tests

a70ed89

Signed-off-by: Hao Wu <skyw@nvidia.com>

add test for empty dim

1ca40c3

Signed-off-by: Hao Wu <skyw@nvidia.com>

add test for all_eigenbases_met_criteria, fix bug

3b0fc7e

Signed-off-by: Hao Wu <skyw@nvidia.com>

use 2d empty as place holder for kronecker factor

216bad2

Signed-off-by: Hao Wu <skyw@nvidia.com>

fix criteria check bug and add test

2ffc244

Signed-off-by: Hao Wu <skyw@nvidia.com>

add more coverage for calculate_sim_ademamix_update

ceeaf64

Signed-off-by: Hao Wu <skyw@nvidia.com>

add more coverage for lion

66fefd5

Signed-off-by: Hao Wu <skyw@nvidia.com>

improve test for schedule functions

362c781

Signed-off-by: Hao Wu <skyw@nvidia.com>

skyw requested a review from a team as a code owner March 6, 2026 00:40

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

skyw and others added 2 commits March 5, 2026 16:52

Update tests/test_scalar_optimizers.py

7425047

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Hao Wu <skyw@users.noreply.github.com>

revert greptile mess

07659c4

Signed-off-by: Hao Wu <skyw@nvidia.com>

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

tests/test_scalar_optimizers.py Show resolved Hide resolved

tests/ci/L0_Tests_CPU.sh Show resolved Hide resolved

skip some flaky test

695ce68

Signed-off-by: Hao Wu <skyw@nvidia.com>

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

tests/test_soap_utils.py Show resolved Hide resolved

tests/test_scalar_optimizers.py Show resolved Hide resolved

fix eigenvector vs eigenvalue

b4ea359

Signed-off-by: Hao Wu <skyw@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 6, 2026 01:17 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 01:17 Inactive

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

tests/ci/L0_Tests_CPU.sh Show resolved Hide resolved

tests/test_soap_utils.py Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 01:26 Inactive

skyw enabled auto-merge (squash) March 6, 2026 02:49

skyw requested a review from mkhona-nvidia March 6, 2026 03:29

Update tests/test_scalar_optimizers.py

23e4b0c

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Hao Wu <skyw@users.noreply.github.com>

copy-pr-bot bot temporarily deployed to test March 6, 2026 16:34 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 16:35 Inactive

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

tests/test_scalar_optimizers.py Outdated Show resolved Hide resolved

copy-pr-bot bot had a problem deploying to nemo-ci March 6, 2026 16:41 Failure

fix AI error

0825b7e

Signed-off-by: Hao Wu <skyw@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 6, 2026 17:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 17:15 Inactive

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

tests/test_soap.py Outdated Show resolved Hide resolved

Update tests/test_soap.py

1215f18

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Hao Wu <skyw@users.noreply.github.com>

copy-pr-bot bot deployed to test March 6, 2026 17:27 Active

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 17:27 Inactive

copy-pr-bot bot deployed to nemo-ci March 6, 2026 17:32 Active

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 17:32 Inactive

Conversation

skyw commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

greptile-apps bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Comments Outside Diff (1)

Uh oh!

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

skyw Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skyw commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

skyw commented Mar 6, 2026

Uh oh!

Uh oh!

skyw commented Mar 6, 2026

Uh oh!

Uh oh!

skyw commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

skyw commented Mar 6, 2026 •

edited

Loading

greptile-apps bot commented Mar 6, 2026 •

edited

Loading

github-actions bot commented Mar 6, 2026 •

edited

Loading